21 research outputs found
A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels
It is universally known that caching is critical to attain high- performance
implementations: In many situations, data locality (in space and time) plays a
bigger role than optimizing the (number of) arithmetic floating point
operations. In this paper, we show evidence that at least for linear algebra
algorithms, caching is also a crucial factor for accurate performance modeling
and performance prediction.Comment: Submitted to the Ninth International Workshop on Automatic
Performance Tuning (iWAPT2014
Cache-aware Performance Modeling and Prediction for Dense Linear Algebra
Countless applications cast their computational core in terms of dense linear
algebra operations. These operations can usually be implemented by combining
the routines offered by standard linear algebra libraries such as BLAS and
LAPACK, and typically each operation can be obtained in many alternative ways.
Interestingly, identifying the fastest implementation -- without executing it
-- is a challenging task even for experts. An equally challenging task is that
of tuning each routine to performance-optimal configurations. Indeed, the
problem is so difficult that even the default values provided by the libraries
are often considerably suboptimal; as a solution, normally one has to resort to
executing and timing the routines, driven by some form of parameter search. In
this paper, we discuss a methodology to solve both problems: identifying the
best performing algorithm within a family of alternatives, and tuning
algorithmic parameters for maximum performance; in both cases, we do not
execute the algorithms themselves. Instead, our methodology relies on timing
and modeling the computational kernels underlying the algorithms, and on a
technique for tracking the contents of the CPU cache. In general, our
performance predictions allow us to tune dense linear algebra algorithms within
few percents from the best attainable results, thus allowing computational
scientists and code developers alike to efficiently optimize their linear
algebra routines and codes.Comment: Submitted to PMBS1
Performance Modeling and Prediction for Dense Linear Algebra
This dissertation introduces measurement-based performance modeling and
prediction techniques for dense linear algebra algorithms. As a core principle,
these techniques avoid executions of such algorithms entirely, and instead
predict their performance through runtime estimates for the underlying compute
kernels. For a variety of operations, these predictions allow to quickly select
the fastest algorithm configurations from available alternatives. We consider
two scenarios that cover a wide range of computations:
To predict the performance of blocked algorithms, we design
algorithm-independent performance models for kernel operations that are
generated automatically once per platform. For various matrix operations,
instantaneous predictions based on such models both accurately identify the
fastest algorithm, and select a near-optimal block size.
For performance predictions of BLAS-based tensor contractions, we propose
cache-aware micro-benchmarks that take advantage of the highly regular
structure inherent to contraction algorithms. At merely a fraction of a
contraction's runtime, predictions based on such micro-benchmarks identify the
fastest combination of tensor traversal and compute kernel
Large Scale Parallel Computations in R through Elemental
Even though in recent years the scale of statistical analysis problems has
increased tremendously, many statistical software tools are still limited to
single-node computations. However, statistical analyses are largely based on
dense linear algebra operations, which have been deeply studied, optimized and
parallelized in the high-performance-computing community. To make
high-performance distributed computations available for statistical analysis,
and thus enable large scale statistical computations, we introduce RElem, an
open source package that integrates the distributed dense linear algebra
library Elemental into R. While on the one hand, RElem provides direct wrappers
of Elemental's routines, on the other hand, it overloads various operators and
functions to provide an entirely native R experience for distributed
computations. We showcase how simple it is to port existing R programs to Relem
and demonstrate that Relem indeed allows to scale beyond the single-node
limitation of R with the full performance of Elemental without any overhead.Comment: 16 pages, 5 figure
High-Performance Solvers for Dense Hermitian Eigenproblems
We introduce a new collection of solvers - subsequently called EleMRRR - for
large-scale dense Hermitian eigenproblems. EleMRRR solves various types of
problems: generalized, standard, and tridiagonal eigenproblems. Among these,
the last is of particular importance as it is a solver on its own right, as
well as the computational kernel for the first two; we present a fast and
scalable tridiagonal solver based on the Algorithm of Multiple Relatively
Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers,
PMRRR is part of the freely available Elemental library, and is designed to
fully support both message-passing (MPI) and multithreading parallelism (SMP).
As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP
fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's
solvers on two supercomputers. Such a study, performed with up to 8,192 cores,
provides precise guidelines to assemble the fastest solver within the ScaLAPACK
framework; it also indicates that EleMRRR outperforms even the fastest solvers
built from ScaLAPACK's components
High Performance Solutions for Big-data GWAS
In order to associate complex traits with genetic polymorphisms, genome-wide
association studies process huge datasets involving tens of thousands of
individuals genotyped for millions of polymorphisms. When handling these
datasets, which exceed the main memory of contemporary computers, one faces two
distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes
come at the cost of hundreds of gigabytes of data, which can only be kept in
secondary storage; 2) the relatedness of the test population is represented by
a relationship matrix, which, for large populations, can only fit in the
combined main memory of a distributed architecture. In this paper, by using
distributed resources such as Cloud or clusters, we address both challenges:
The genotype and phenotype data is streamed from secondary storage using a
double buffer- ing technique, while the relationship matrix is kept across the
main memory of a distributed memory system. With the help of these solutions,
we develop separate algorithms for studies involving only one or a multitude of
traits. We show that these algorithms sustain high-performance and allow the
analysis of enormous datasets.Comment: Submitted to Parallel Computing. arXiv admin note: substantial text
overlap with arXiv:1304.227
Algorithms for Large-scale Whole Genome Association Analysis
In order to associate complex traits with genetic polymorphisms, genome-wide
association studies process huge datasets involving tens of thousands of
individuals genotyped for millions of polymorphisms. When handling these
datasets, which exceed the main memory of contemporary computers, one faces two
distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds
of Gigabytes of genotype data, which can only be kept in secondary storage; 2)
the relatedness of the test population is represented by a covariance matrix,
which, for large populations, can only fit in the combined main memory of a
distributed architecture. In this paper, we present solutions for both
challenges: The genotype data is streamed from and to secondary storage using a
double buffering technique, while the covariance matrix is kept across the main
memory of a distributed memory system. We show that these methods sustain
high-performance and allow the analysis of enormous datase
Automatic Generation of Efficient Linear Algebra Programs
The level of abstraction at which application experts reason about linear
algebra computations and the level of abstraction used by developers of
high-performance numerical linear algebra libraries do not match. The former is
conveniently captured by high-level languages and libraries such as Matlab and
Eigen, while the latter expresses the kernels included in the BLAS and LAPACK
libraries. Unfortunately, the translation from a high-level computation to an
efficient sequence of kernels is a task, far from trivial, that requires
extensive knowledge of both linear algebra and high-performance computing.
Internally, almost all high-level languages and libraries use efficient
kernels; however, the translation algorithms are too simplistic and thus lead
to a suboptimal use of said kernels, with significant performance losses. In
order to both achieve the productivity that comes with high-level languages,
and make use of the efficiency of low level kernels, we are developing Linnea,
a code generator for linear algebra problems. As input, Linnea takes a
high-level description of a linear algebra problem and produces as output an
efficient sequence of calls to high-performance kernels. In 25 application
problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen
and Armadillo, with speedups up to and exceeding 10x